Tracing a Loose Wordhood for Chinese Input Method Engine
نویسندگان
چکیده
Chinese input methods are used to convert pinyin sequence or other Latin encoding systems into Chinese character sentences. For more effective pinyin-to-character conversion, typical Input Method Engines (IMEs) rely on a predefined vocabulary that demands manually maintenance on schedule. For the purpose of removing the inconvenient vocabulary setting, this work focuses on automatic wordhood acquisition by fully considering that Chinese inputting is a free human-computer interaction procedure. Instead of strictly defining words, a loose word likelihood is introduced for measuring how likely a character sequence can be a user-recognized word with respect to using IME. Then an online algorithm is proposed to adjust the word likelihood or generate new words by comparing user true choice for inputting and the algorithm prediction. The experimental results show that the proposed solution can agilely adapt to diverse typings and demonstrate performance approaching highly-optimized IME with fixed vocabulary.
منابع مشابه
Chinese Pinyin Input Method for Mobile Phone
Chinese input method is one of the most difficult problems in Chinese Language Processing. And to input Chinese word in mobile phone effectively is an even bigger challenge. In this paper, we propose a new Chinese pinyin input method in mobile phone. This method uses a compact statistical bigram based language model. Also, to meet the special requirements of Chinese pinyin input in mobile phone...
متن کاملGrowing Related Words from Seed via User Behaviors: A Re-Ranking Based Approach
Motivated by Google Sets, we study the problem of growing related words from a single seed word by leveraging user behaviors hiding in user records of Chinese input method. Our proposed method is motivated by the observation that the more frequently two words cooccur in user records, the more related they are. First, we utilize user behaviors to generate candidate words. Then, we utilize search...
متن کاملKySS 1.0: a Framework for Automatic Evaluation of Chinese Input Method Engines
Chinese Input Method Engine (IME) plays an important role in Chinese language processing. However, it has been subjected to lacking a proper evaluation metric for a long time. The natural metric for IME is user experience, which is a rather vague goal for research purpose. We propose a novel approach of quantifying user experience by using keystroke count and then correspondingly develop a fram...
متن کاملWhy Chinese Web-as-Corpus is Wacky? Or: How Big Data is Killing Chinese Corpus Linguistics
This paper aims to examine and evaluate the current development of using Web-as-Corpus (WaC) paradigm in Chinese corpus linguistics. I will argue that the unstable notion of wordhood in Chinese and the resulting diverse ideas of implementing word segmentation systems have posed great challenges for those who are keen on building web-scaled corpus data. Two lexical measures are proposed to illus...
متن کاملA Novel Fuzzy Chinese Address Matching Engine Based on Full-text Search Technology
The ability to locate addresses is one of the most important features in an urban geographic information system. Since Chinese geocoding problem cannot be handled by the European geocoding method, some Chinese scholars did special researches on Chinese geocoding. Current researches all focus on address standardizations and models, and pay less attention to the user input and result control. We ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1712.04158 شماره
صفحات -
تاریخ انتشار 2017